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(54) Voice synthesizing apparatus 



(57) A voice synthesizing apparatus comprises: piece by using a pitch as an index; and a voice synthe- 

means for storing phoneme pieces having a plurality of sizer that synthesizes a voice in accordance with the 

different pitches for each phoneme represented by a read phoneme piece, 
same phoneme symbol; means for reading a phoneme 
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Description 

CROSS REFERENCE TO RELATED APPLICATION 



[0001] This application is based on Japanese Patent Application No. 2001-067258, filed on March 9, 2001 the entire 
contents of which are incorporated herein bv reference ' 



icorporated herein by reference 
BACKGROUND OF THE INVENTION 
"> A) FIELD OF THE INVENTION 

[0002] The present invention relates to a voice synthesizing apparatus, and more particularly to a voice svnthesizina 
apparatus for synthesizing human singing voice. uwr.yio a voice synthesizing 

'5 B) DESCRIPTION OF THE RELATED ART 

[0003] Human voice consists of phones or phonemes that consists of a plurality of formants . In synthesis of human 
smging voice, f.rst, al, formants constituting each of all phonemes that human can speak are £5S3E£I,™ 
essary phones. Next, a plurality of generated phones are sequential concatenated and pUches aJ ^conZed in 

sounn " ThiS S y nthesizin 9 metnod * usable not only to human «S^^L^E2j 

sounds generated by a musical instrument such as a wind instrument 

Hwi!? 9 s y nthesizi "9 apparatus utilizing this method is already known. For example, Japanese Patent No 

^^^r*~:^^ — — ' « —~ ° — "ound faving even a^h 

[0005] It is known that the formant frequency depends upon a pitch. As disclosed in JP-A-HEI-6-308997 a database 

rooofi? T Ph T" 81 SaCh PitCh iS US6d l ° SeleCt P r °P er P" oneme P«— accordance with the voice pSch 
! ff ? SU f 3 C ° nventional database that each phoneme consists of several phoneme pieces that 

have different pitches, the size of the database becomes relatively large Pnoneme pieces that 

[0008] Furthermore, since the formant frequency does not depend only upon the Ditch but it ri flnonfl<! Qlo „ 
other parameters such as dynamics, the data amount increases fn the unJ oTsquare and "cube. ^ 

SUMMARY OF THE INVENTION 



S^hLI'! ^ K?l Ct f Pr6Sent inVenti ° n t0 prOVide a voice synthesizing apparatus capable of reducing the size 
of a database while deterioration of the sound quality is minimized saucing me size 

mni ?! I iS a w° th6r ° bjeCt ° f the inVenti ° n ,0 pr0Vide a voice synthesizing apparatus using such a database 

[001 1] According to one aspect of the present invention, there is provided a vote synthesizing apparaSfcim D ri S in a - 

LnthtTl V : meanS f ° r read ' ng 8 Ph ° neme Piece ^ usin 9 a P itch as a " index; and a voice synthesized 
synthesizes a voice in accordance with the read phoneme piece syninesizer that 

[001 2] According to another aspect of the present invention, there is provided a voice synthesizina aooaratus mm 

nn! ! H Y T P me Symb0 ' ; meanS f ° r readins 3 P noneme P iece b y usi ng the musical ^^^xpSon as an 
* and means for synthesizing a voice in accordance with the read phoneme piece ^ssion as an 

[00 3] According to a further aspect of the present invention, there is provided a voice synthesizing apparatus com 

svmbo 9 ;- "IT I 9 9 P ' Urality ° f differem Ph ° neme pieC6S f ° r each P honen - -presented oy a S^Si^ 
symbol, means for inputting voice information for voice synthesis; means for calculating a phoneme piece match^c 

match ; n h f0rma,, ° n f by interpolation using the phoneme pieces stored in said means for storing, Ze p oneme piece 

VO ' Ce ^ , ° n St ° reCj in mSanS f ° r St ° ring; and means for ^nthesizing a voice n accordance 
with the phoneme piece calculated through interpolation accordance 

SmnL ACC ° rding J° a sti " furtner as P ect of the present invention, there is provided a voice synthesizing apparatus 
comprising: means for storing a change amount of a voice feature parameter as template dJ^™ for hSSi 
voice information for voice synthesis; means for reading the template data from said memory n a3a ce wiHe 
voice information; and means for synthesizing a voice in accordance with the read t^np^^SSTSS^ 

[0015] As above, it is possible to provide a voice synthesizing database with a reduced size while deterioration of 
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the voice quality is minimized. 

[0016] It is also possible to provide a voice synthesizing apparatus capable of synthesizing more realistic human 
voices of a song and singing the song in a state without unnaturalness. 

5 BRIEF DESCRIPTION OF THE DRAWINGS 

[0017] Fig. 1 is a block diagram showing the structure of a voice synthesizing apparatus 1 according to an embod- 
iment of the invention. 

[0018] Fig. 2 is a conceptual diagram showing an example of input data Score. 

10 [0019] Fig. 3 is a diagram showing an example of a Timbre database TDB. 

[0020] Fig. 4 is a diagram showing another example of a Timbre database TDB. 

[0021] Fig. 5 is a diagram showing an example of a stationary template database. 

[0022] Fig. 6 is a diagram showing an example of an articulation template database. 

[0023] Fig. 7 is a diagram showing an example of an NA template database NADB. 

15 [0024] Fig. 8 is a diagram showing an example of an NN template database NNDB. 

[0025] Fig. 9 is a flow chart illustrating a feature parameter generating process. 

[0026] Figs. 1 0A to 1 0C are graphs showing examples of dynamics functions. 

[0027] Fig. 11 is a graph showing an example of an opening function. 

[0028] Fig. 12 is a diagram illustrating an example of a first application of templates according to the embodiment. 

20 [0029] Fig. 1 3 is a diagram illustrating a modification of the first application of templates according to the embodiment. 

[0030] Fig. 1 4 is a diagram illustrating an example of a second application of templates according to the embodiment. 

[0031] Fig. 15 is a diagram illustrating an example of a third application of templates according to the embodiment. 
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DESCRIPTION OF THE PREFERRED EMBODIMENTS 



[0032] Fig. 1 is a block diagram showing the structure of a voice synthesizing apparatus 1 . 

[0033] The voice synthesizing apparatus 1 has a data input unit 2, a feature parameter generating unit 3, a database 
4 and an EpR voice synthesizing engine 5. 

[0034] Input data Score input to the data input unit 2 is sent to the feature parameter generating unit 3 and EpR voice 
30 synthesizing engine 5. In accordance with the input data Score, the feature parameter generating unit 3 reads feature 
parameters and various templates to be described later from the database 4. The feature parameter generating unit 
3 applies various templates to the read feature parameters to generate final feature parameters and send them to the 
EpR voice synthesizing engine 5. 

[0035] The EpR voice synthesizing unit 5 generates pulses in accordance with the pitches, dynamics and the like of 
35 the input data Score, and applies feature parameters to the generated pulses to synthesize and output voices. 

[0036] Fig. 2 is a conceptual diagram showing an example of the input data Score. The input data Score is constituted 
of a phoneme track PHT, a note track NT, a pitch track PIT, a dynamics track DYT, and an opening track OT. The input 
data Score is song data of song phrases or the whole song, and changes with time. 

[0037] The phoneme track PHT includes phoneme names and their voice production continuation times. Each pho- 
40 neme is classified into two parts: Articulation representative of a transition part between phonemes; and Stationary 
representative of a stationary part. Each phoneme includes flags for distinguishing between Articulation and Stationary. 
Since Articulation is the transition part, it has phoneme names, namely preceding and succeeding phoneme names. 
Since Stationary is the stationary part, it has only one phoneme name. 

[0038] The note track NT records flags each indicating one of a note attack (NoteAttack), a note-to-note (NoteToNote) 
and a note release (NoteRelease). NoteAttack, NoteToNote and NoteRelease are commands for designating musical 
expression at the rising (attack) time of voice production, at the pitch change time, and at the falling (release) time of 
voice production, respectively. 

[0039] The pitch track PIT records the fundamental frequency at each timing of a voice to be vocalized. The pitch of 
an actually generated sound is calculated in accordance with pitch information recorded in the pitch track PIT and other 
50 information. Therefore, the pitch of an actually produced sound may differ from the pitch recorded in this pitch track PIT 
[0040] The dynamics track DYT records a dynamics value at each timing, which value is a parameter indicating an 
intensity of voice. The dynamics value takes a value from 0 to 1 . 

[0041] The opening track OT records an opening value at each timing, which value is a parameter indicating the 
opening degree of lips (lip opening degree). The opening value takes a value from 0 to 1 . 
55 [0042] In accordance with the input data Score input from the data input unit 2, the feature parameter generating 
unit 3 reads data from the database 4, and as will be later described, generates feature parameters in accordance with 
the input data Score and the data read from the database 4, and outputs the feature parameters to the EpR voice 
synthesizing engine 5. 
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[0043] The feature parameters to be generated by the feature parameter generating unit 3 can be classified for 
example into four types: an envelope of excitation waveform spectra; excitation resonances; formants; and differential 
spectra. These four feature parameters can be obtained by resolving a spectrum envelope (original spectrum envelope) 
of harmonic components obtained by analyzing voices (original voices) of a person or the like 
[0044] The envelope (ExcitationCurve) of excitation waveform spectra is constituted of three parameters- EGain 
indicating an amplitude (dB) of a glottal waveform; ESIopeDepth indicating a slope of the spectrum envelope of the 
glottal waveform; and ESIope indicating a depth (dB) from a maximum value to a minimum value of the spectrum 
envelope of the glottal waveform. ExcitationCurve can be expressed by the following equation (A)- 



ExcitationCurve(f) = EGain + ESIopeDepth * (exp(-£S/ope * f)-1) 



(A) 



[0045] The excitation resonance is a chest resonance. The excitation resonance is constituted of three parameters 
including a center frequency (ERFreq) : a band width (ERBW) and an amplitude (ERAmp), and has the second-order 
15 filter characteristics. 

[0046] The formant indicates a vocal tract resonance made of twelve resonances. The formant is constituted of three 
pararneters including a center frequency (FormantFreqi), a band width (FormantBWI ) and an amplitude (FormantAm- 
pi), where "i" takes a value from 1 to 12 (1 s i < 12). 

[0047] The differential speclrum is a feature parameter that has a diflerenlial spectrum from the original spectrum 
the differential spectrum being unable to be expressed by the three parameters: the envelope of excitation waveform 
spectra, excitation resonances and formants. 

[0048] The database 4 is constituted of, at least a Timbre database TDB, a phoneme template database PDB and 
a note template database NDB. 

[0049] Ingeneral, if voices are synthesized by using only feature parameters at a specific timing stored in the Timbre 
database TDB, the synthesized voices become very monotonous and mechanical. If phonemes are continuously gen- 
erated voices in the transition part between phonemes change gradually in the actual case. Therefore, if the stationary 
parts of phonemes are simply concatenated, a very unnatural voice is produced at the concatenated point These 
disadvantages can be mitigated by voice synthesis using the phoneme template and note template 
[0050 Timbre is a tone color of a phoneme and is expressed by feature parameters at one timing point (a set of the 

H^hl°rTn R C xT' ! X fT° n u resonance ' formant and differential spectrum). Fig. 3 shows an example of the Timbre 
database TDB. This database has a phoneme name and a pitch as its indices 

flfn 1 r h °K 9h T ' mbre databaSe TDB Sh ° Wn in Rg ' 3 is USed in this embodiment, a database having four indices 
including the phoneme name, pitch, dynamics and opening such as shown in Fig. 4 may be used 

temrSL nit P * one ™ database PDB is constituted of a stationary template database and an articulation 

f !' , The . tem P. late ls a set ° f a S6 ^nce having: pairs of a feature parameter P and a pitch Pitch 

J^i a JS2TST Intervali and a len9th T (sec) of the sequence - The template can be expressed by 



40 Template = {P(t),Pitch(f),T} 



(B) 



where t = 0, At, 2At, 3At T. In this embodiment, At is 5 ms. 

EJSLi 8 *l* m8de , Sh0rt ' f h ° U9h thS S ° Und qUa ' ity b6C ° meS g00d because of a hi 9 h time resolution, the size of 
the database becomes large. Conversely, as At is made long, although the sound quality becomes bad the size of the 

smal1 ' When At is determined ' the priority order of the sound qua "* and da,abase tlen 

[0054] Fig. 5 shows an example of the stationary template database. The stationary template database uses a pho- 
neme name and a representative pitch as its indices, and has stationary templates of all phonemes of voiced sounds 
so mode! CrSated ^ ana ' y2ing V ° iCeS haVi0g Stab ' e P honemes pitches by utilizing an EpR 

[0055] If one voice of a voiced sound, e.g., "a", is produced during a prolonged period at some pitch e q at C4 it 

Howe!- r'thlrt 1 T T ** l0mant f r ^ uencies ™ generally constant and stationary. 

However, there is some fluctuation ,n an actual case. If this fluctuation does not exist and the feature parameters are 

55 ^.1 y H C °"f tan , t ' Synth f esized voices are flat and mechanical. In other words, this fluctuation expresses the individ- 
55 uality and naturalness of each person. u,v,a 

^SJUT l VOi °f °! « V0iC6d S ° Und iS s y nthesized ' not on| y ^mbre, i.e., the feature parameters at one timing 
are used but adding to ,t fluctuation of feature parameters and pitches derived from voices of an- actual person and 
stored in the stationary templates gives the voice of a voiced sound the naturalness 
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[0057] In synthesizing voices of a song, it is necessary to change a sound production time with the length of each 
note. However, only a single long template is prepared. If we synthesize a voiced sound longer than the template, this 
template is directly applied starting from the leading part of the voice of a voiced sound without stretching or shrinking 
the time axis of the template. 

5 [0058] If the voice reaches the end of the template, the same template is again applied from the time point. If the 
voice reaches the end of the template, a template with a reversed time axis may be applied. With this method, discon- 
tinuity at the connection point between the templates does not exist. 

[0059] If the time axis of the template is stretched or shortened, the speed of a change in the feature parameters 
and pitches change greatly and the naturalness is degraded. It is preferable not to change the time axis of the template, 

10 also from the viewpoint that a human being does not consciously control the fluctuation in the stationary part. 

[0060] The stationary template does not have the time series of feature parameters themselves in the stationary 
part, but it has representative typical feature parameters of each phoneme and change amounts of the feature param- 
eters. The change amounts of the feature parameters in the stationary part are small. Therefore, as compared to having 
feature parameters themselves, having the change amounts reduces the information amount so that the size of the 

15 database can be made small. 

[0061] Fig. 6 shows an example of the articulation template database. The articulation template database uses a 
preceding phoneme name, a succeeding phoneme name, and a representative pitch as its indices. In the articulation 
template database, the articulation template has combinations of phonemes of a language which phonemes can be 
actually realized. 

[0062] The articulation template can be obtained by analyzing voices of phonemes in the concatenated part with a 
stable pitch by utilizing an EpR model. 

[0063] The feature parameter P(t) may be either an absolute value or a differential value. As will be later described, 
the absolute values of these templates are not directly used for voice synthesis, but the relative change amounts of 
parameters are used. Therefore, in accordance with the template application method, the feature parameters are re- 
corded in the form of a difference from P(t = T), a difference from P(0), or a difference from a straight line interconnecting 
P(0) and P(T) as shown in the following equations (C1 to C3): 



20 
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30 



35 



Template^ = { P( t) - P( 7) , Pitch( t) - Pitch( 7), 7] (C1 ) 

Template2 = {P(f) - P(0), Pitch(t) - Pitch(0), J] (C2) 

[Pitch(t) - ((Pitch(T) - Pitch(0)) * 1 1 T + Pitch®)), 7 J ' 

40 [0064] When a person utters two phonemes continuously, the voices do not change abruptly, but utterance of the 
voices changes gradually. For example, if after a vowel "a" is pronounced, a vowel "e" is pronounced continuously 
without any pose, the vowel "a" is first produced, and a voice intermediate of "a n and "e" is generated to change to "e". 
[0065] This phenomenon is generally called co-articulation. In order to synthesize providing a natural concatenated 
phonemes, it is preferable to provide voice information in the concatenated part in some desired form for each of 

45 combinations of phonemes of a language which phonemes can be actually realized. 

[0066] It is already know that the concalenating part between phonemes is provided in the form of LPC coefficients 
and speech waveforms. In this embodiment, the articulation part between two phonemes is synthesized by using an 
articulation template having differential information of feature parameters and pitches. 

[0067] For example, consider the case wherein a song having two continuous words "a" and "i" of a quarter note at 
so the same pitch is synthesized. There is a transition part from "a" to V in the boundary area between two notes. Both 
"a" and M i" are vowels and a voiced sound. This transition part corresponds to an articulation from V (voiced sound) to 
V (voiced sound). In this case, the feature parameters in the transition part can be obtained by applying the articulation 
template by using a method of Type 2 to be described later. 

[0068] Namely, the feature parameters of "a" and "i" are read from the Timbre database TDB and the articulation 
55 template from "a" to "i" is applied to the feature parameters. In this manner, the feature parameters having a natural 
change of the transition part can be obtained. 

[0069] If the time of the transition part from "a" to V is set to the original time of the articulation template to be applied 
to the transition part, the same change as that of voice waveforms used when the template was formed can be obtained. 
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[0070] In synthesizing a voice changing slower or longer than the template time, after the length of the template is 
linearly stretched, a difference of feature parameters is added. As different from the stationary part described earlier, 
since the speed of a change part between two phonemes can be controlled consciously, even if the template is linearly 
stretched, naturalness is not damaged greatly. 

[0071] Next, consider the case wherein a song having two continuous words "a" and "su" of a quarter note at the 
same pitch is synthesized. There is a short transition part from "a" to the consonant of "su", that is "s" in the boundary 
area between two notes. This transition part corresponds to an articulation from V (voiced sound) to U (unvoiced sound) 
In this case the feature parameters in the transition part can be obtained by applying the articulation template by using 
a method of Type 1 to be described later. a 
[0072] Feature parameters of "a" are read from the Timbre database TDB and an articulation template from "a" to 
s' is applied to the read feature parameters. In this manner, the feature parameters having a natural chanqe of the 
transition part can be obtained. a 
[0073] The reason why Type 1 , i.e., a difference from the start part of the template, is used for the articulation from 
V (voiced sound) to U (unvoiced sound) is simply because pitches and feature parameters do not exist in U (unvoiced 
sound) corresponding to the end part. 1 
[0074] "su" is constituted of a consonant "s" and a vowel V. A transition part also exists in the boundary area where 
u ,s pronounced while keeping the sound "s". This articulation part corresponds to the articulation from U to V so that 
the articulation template is applied by using the method of Type 1 . 

Pa P? 75] *: e ! lUre parameters of " u " are read from 'he Timbre database TDB and an articulation template from "s" to 
u is applied to the feature parameters to obtain the feature parameters of the transition part from "s" to "u" 
[0076] The articulation template having differential information of feature parameters is advantageous in that the 
data size becomes smaller than the template having absolute value feature parameters 

[0077] The note template database NDB has at least a note attack template (NA template) database NADB a note 
release template (NR template) database NRDB, and a note-to-note template (NN template) database NNDB 
[0078] F,g. 7 shows an example of the NA template database NADB. The NA template has information of feature 
parameters and pitches in the voice rising part. 

[0079] The NA template database NADB stores NA templates for phonemes of all voiced sounds by using a phoneme 
rising par! * r6preSentatiVe pitch as indices ' The NA template is obtained by analyzing actually produced voices in the 

30 [0080] The NR template has information of the feature parameters and pitches in the voice falling part The NR 
template database NRDB has the same structure as that of the NA template database NADB, and has NR templates 
for phonemes of all voiced sounds by using a phoneme name and a representative pitch as indices 
[0081] As the ns.ng part (Attack) of a phoneme vocalized at a certain pitch, e.g., "a" is analyzed, it can be seen that 
the amplitude becomes gradually large and stabilizes when it takes a certain level. Not only the amplitude value but 

as also the formant frequency, formant bandwidth and pitch also change. 

[0082] If the NA template obtained by analyzing the rising part of an actual human voice, e.g., "a" is applied to the 

nnft-,7 P f,T? ,°l the f Stat , ionar y part - a natural chan 9 e in human voice in the rising part can be given. 
[0083] If NA templates for all phonemes are prepared, it is possible to give a change in every phoneme to the attack 
pari. 

40 I?° 8 mI ♦ A son 9 is u sun 9 b y makin 9 therisin 9 s Peed up and down in orderto give particularmusical expression Althouqh 
the NA template has one rising time, the speed in the rising part of the NA template can be increased or decreased 
by linearly expanding or contracting the time axis of the template. 

[0085] It is known from experiments that unnaturalness of the attack part does not occur if the expansion/contraction 

ZZl T ,t' S range ° f S6Veral tim6S - ' n ° rdert0 Perf0rm voice s y nthesi s ^ designating the length of the 

TZZ grange, NA templates having lengths at several levels may be prepared and the template having 

TnlLlT nearSSt t0 thS attaCk Part iS SeleCted and expanded or contracted. Other methods may also be used 
[0086] Similar to the rising (Attack) part, the amplitudes, pitches and formants change in the end part of an utterance 
i.e., railing (Release) part. ' 

[0087] In order to give a natural change of human voices to the falling part, an NR template obtained by analyzing 
faTng n pS V °' CeS ^ " * f6atUre parameters of a ph °"eme just before the start of the 

S!tf 1 F V Sh ,°r?" 6XamPle ° f thG NN temP ' ate databaSS NNDB - The NN tem P ,ate has the feature parameters 
of voices ir , the p.tch changmg part. The NN template data base NNDB stores NN templates for all phonemes of voiced 

55 oft^e Jemplte 3 ^ * * * ^ * timi " 9 * t6mplate a " d 3 P " ch at tne end timin 9 

[0089] There is a singing method of continuously singing two notes having different pitches without any pose by 

luT^ Th 9 h PitCh u f Pr6Ceding n ° te tD thS Pitch ° f the ceding note. Although it is obvious that the 
pitch and amplitude change, the voice frequency characteristics such as the formant frequency also change finely even 
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if pronunciation of the preceding and succeeding two notes are the same (e.g., the same "a"). 

[0090] By using the NN template obtained by analyzing a change in actual human voices by changing the pitch from 
the start point to end point, natural musical expression can be given in the boundary area between notes having different 
pitches. 

5 [0091] In an actual musical melody, there are many combinations of pitch changes even in the compass of 2 octaves 
or 24 semi-tones. However, even if the absolute values of pitches are different, a template having a small pitch difference 
can be used as a substitute so that NN templates for all pitch change combinations are not required to be prepared. 
[0092] As will be later described, in selecting the NN template 5 a template having a small pitch change width is 
selected with a priority over a template having a small pitch absolute value difference. The selected NN template is 

10 applied by using a method of Type 3 to be later described. 

[0093] The reason why the NN template having the small pitch change width is selected is as follows. There is a 
possibility that the NN template obtained from the part where the pitch changes greatly has big values. If this NN 
template is applied to the part where the pitch change width is small, the change shape of the original NN template 
cannot be retained and there is a possibility that the change becomes unnatural. 

[0094] An NN template obtained from a voice of a particular phoneme : e.g., "a" whose pitch changes may be used 
for the pitch change of all phonemes. However, in the environment that a large data size poses no problem, it is 
preferable to prepare NN templates for pitch changes of several patterns of each phoneme in order to generate syn- 
thesized sounds that are not monotonous and are rich in expression. 
[0095] Next, the method of applying each template stored in the database 4 will be described. In applying a template 
20 to some section of the input data Score, the time axis of the template is stretched or shortened and a difference from 
a feature parameter of the template is added to one or a plurality of feature parameters at the reference point to obtain 
a train of feature parameters and pitches having the time length same as that of the section of Score. There are four 
template applying methods Type 1 to Type 4. In the following description, a template is expressed by {P(t), Pitch(t), T). 
[0096] First, the template applying method of Type 1 will be described. Type 1 is the template applying method that 
25 uses a start point. Applying the template applying method of Type 1 for a section K of the input data Score having a 
length T means calculating the feature parameter P't at the time t by the following equation (D): 
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P7 = P # + F(f.77n-F(0) (D) 
where iR is a set of feature parameters in the section K at the time t. 

[0097] It is assumed that the start point of the template and section K is at the time t = 0. The equation (D) means 
that a change amount from the start point of the template is added to the feature parameter at the time t. 
[0098] Type 1 is used mainly when the template is applied to the feature parameter in the note release part. The 
reason for this is as follows. A voice in the stationary part exists in the start portion of the note release so that it is 
necessary to maintain the parameter continuity, i.e., voice continuity in the start portion of the note release, whereas 
no voice exists in the end portion of the note release so that it is not necessary to maintain the parameter continuity. 
[0099] Next, the template applying method of Type 2 will be described. Type 2 is the template applying method that 
uses an end point. Applying the template applying method of Type 2 for a section K of the input data Score having a 
length T means calculating the feature parameter P't at the time t by the following equation (E): 

P' t = P t + P(tT/T r ) -P(J) (E) 
where Pt is a set of the feature parameters in the section K at the time t. 

[0100] It is assumed that the start point of the template and section K is at the time t = 0. The equation (E) means 
that a change amount from the end point of the template is added to the feature parameter at the time t. 
[0101] Type 2 is used mainly when the template is applied to the feature parameter in the note attack part. The 
reason for this is as follows. A voice in the stationary part exists in the end portion of the note attack so that it is 
necessary to maintain the parameter continuity, i.e., voice continuity in the end portion of the note attack, whereas no 
voice exists in the start portion of the note attack so that ft is not necessary to maintain the parameter continuity. 
[0102] Next, the template applying method of Type 3 will be described. Type 3 is the template applying method that 
uses both the start and end points. Applying the template applying method of Type 3 for a section K of the input data 
Score having a length T means calculating the feature parameter P't at the time t by the following equation (F): 

p\ = p 0 + r XP t ' p o> + ( p e- T/1 ") ~ T' (P{ 71 " P(0))) (F) 
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where R is a set of the feature parameters in the section K at the time t. 

[0103] It is assumed that the start point of the template and section K is at the time t = 0. The equation (F) means 
that a difference from the straight line interconnecting the start and end points of the template is added to the straight 
line interconnecting the start and end points of the section K. 

[01 04] Next, the template applying method of Type 4 will be described. Type 4 is the template applying method that 
uses a stationary type. Applying the template applying method of Type 4 for a section K of the input data Score having 
a length T means calculating the feature parameter P't at the time t by the following equation (G)- 



w P t = P, + P(t modT) - P(0) 



(G) 



where R is a set of the feature parameters in the section K at the time t. 

[01 05] It is assumed that the start point of the template and section K is at the time t = 0. The equation (G) means 
that a change amount from the start point of the template is added to the section K repetitively at every T 
[0106] Type 4 is used mainly when the template is applied to the stationary part. Type 4 gives natural fluctuation to 
the relatively long stationary part of a voice. 

[0107] Fig. 9 is a flow chart illustrating a feature parameter generating process. This process generates feature 
parameters at the time t. The feature parameters generating process repeats at a predetermined time interval increasinq 
the time t to synthesize whole voices in the phrase or song. 

20 [01 08] At Step SA1 the feature parameter generating process starts to thereafter advance to the next Step SA2 
[0109] At Step SA2 values of each track of the input data Score at the time t are acquired. Specifically of the input 
da a Score at the time t, the phoneme name, distinguishment between articulation and stationary, dist'inguishment 
between note attack, note-to-note and note release, a pitch, a dynamics value and an opening value are acquired 
Thereafter, the flow advances to the next Step SA3. 

25 [01 1 0] At Step SA3 in accordance with the value of each track of the input data Score acquired at Step SA2 necessary 
templates are read from the phoneme template database PDB and note template database NDB. Thereafter the flow 
advances to the Next Step SA4. ' 

[ °w 11 L„ Re ! din9 u the phoneme tem P |ate at ste P S A3 is performed, for example, by the following procedure If it is 
30 9 k Phoneme at the time t is articulation, the articulation template database is searched to read a template 

having the coincident preceding and succeeding phoneme names and the nearest pitch 

[0112] If it is judged that the phoneme at the time t is stationary, the stationary template database is searched to 
read a template having the coincident phoneme name and the nearest pitch. 

[01 1 3] Reading the note template is performed by the following procedure. If it is judged that the note track at the 

*5 nTme VS^^^J^^ " * ^ * ^ C °' mC ' M phoneme 

[0114] If it is judged that the note track at the time t is note release, the NR template database NRDB is searched 
to read a template having the coincident phoneme name and the nearest pitch. 

[011 5] If it is judged that the note track at the time t is note-to-note, the NN template database NNDB is searched to 
read a template having the coincident phoneme names and the nearest distance 6. The distance d is calculated by 
the followmg equat.on (H) by using the start pitches and end pitches. The equation (H) uses as a distance scale the 
value obtained by adding a weighted change amount of frequencies and a weighted change amount of average values 



d = 0.81 Templnterval - Interval + 0.2-1 TempAve - Ave\ (H ) 

where 



Templnterval = Itemplate start point pitch - template end point pitchl, 
TempAve = (template start point pitch + template end point pitch)/2, 
Interval = Inote track start point pitch - note track end point pitchl, and 
Ave = (note track start point pitch + note track end point pitch)/2. 



[01 1 6] By reading the template in accordance with the distance d calculated by the equation (H), the template havinq 
the nearest pitch change amount rather than the nearest pitch absolute value can be read 

[01 17] At Step SA4 the start and end times of the area having the same attribute of the note track at the current time 
t are acquired. If the phoneme track is stationary, in accordance with distinguishment between note attack note-to- 
note and note release, the feature parameters at the start time, end time or at the start and end times is acquired or 
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calculated. Thereafter, the flow advances to the next Step SA5. 

[01 1 8] If the note track at the time t Is note attack, the Timbre database TDB Is searched to read feature parameters 
having the coincident phoneme name and the coincident pitch at the note attack end time. 

[0119] If there is no feature parameter having the coincident pitch, two sets of feature parameters having the coin- 
5 cident phoneme name and the pitches sandwiching the pitch at the note attack end time are acquired. The two sets 
of feature parameters are interpolated to calculate the feature parameters at the note attack end time. The details of 
interpolation will be later given. 

[0120] If the note track at the time t is note release, the Timbre database TDB is searched to read feature parameters 
having the coincident phoneme name and the coincident pitch at the note attack start time. 
10 [0121] If there is no feature parameter having the coincident pitch, two sets of feature parameters having the coin- 
cident phoneme name and the pitches sandwiching the pitch at the note attack start time are acquired. The two sets 
of feature parameters are interpolated to calculate the feature parameters at the note attack start time. The details of 
interpolation will be later given. 

[0122] If the note track at the time t is note-to-note, the Timbre database TDB is searched to read feature parameters 
15 having the coincident phoneme name and the coincident pitch at the note-to-note end time. 

[0123] If there is no feature parameter having the coincident pitch, two sets of feature parameters having the coin- 
cident phoneme name and the pitches sandwiching the pitch at the note-to-note start (end) time are acquired. The two 
sets of feature parameters are interpolated to calculate the feature parameters at the note-to-note start (end) time. The 
details of interpolation will be later given. 
20 [0124] If the phoneme track is articulation, the feature parameters at the start and end times are acquired or calcu- 
lated. In this case, the Timbre database TDB is searched to read feature parameters having the coincident phoneme 
names and the coincident pitch at the articulation start time and a feature parameter having the coincident phoneme 
names and the coincident pitch at the articulation end time. 

[0125] If there is no feature parameter having the coincident pitch, two sets of feature parameters having the coin- 
25 cident phoneme names and the pitches sandwiching the pitch at the articulation start (end) time are acquired. The two 
sets of feature parameters are interpolated to calculate the feature parameters at the articulation start (end) time. 
[0126] At Step SA5, the template read at Step SA3 is applied to the feature parameters and pitches at the start and 
end times read at Step SA4 to obtain the pitch and dynamics at the time t. 

[0127] If the note track at the time t is note attack, the NA template is applied to the note attack part by Type 2 by 
30 using the feature parameters of the note attack part at the end time read at Step SA4. After the template is applied, 
the pitch and dynamics (EGain) at the time t are stored. 

[0128] If the note track at the time t is note release, the NR template is applied to the note release part by Type 1 by 
using the feature parameters of the note release part at the note release start point read at Step SA4. After the template 
is applied, the pitch and dynamics (EGain) at the time t are stored. 
35 [01 29] If the note track at the time t is note-to-note, the NN template is applied to the note-to-note part by Type 3 by 
using the feature parameters of the note-to-note start and end times read at Step SA4. After the template is applied, 
the pitch and dynamics (EGain) at the time t are stored. 

[01 30] If the note track at the time t is none of the above-described parts, the pitch and dynamics (EGain) of the input 
data Score are stored. 

40 [0131] After one of the above-described processes is performed, the flow advances to the next Step SA6. 

[0132] At Step SA6 it is judged from the values of each track obtained at Step SA2 whether the phoneme at the time 
t is articulation or not. If the phoneme is articulation, the flow branches to Step SA9 indicated by a YES arrow, whereas 
if not, i.e., if the phoneme at the time t is stationary, the flow advances to Step SA7 indicated by a NO arrow. 
[01 33] At Step SA7 the feature parameters are read from the Timbre database TDB by using as indices the phoneme 

^5 name obtained at Step SA2 and the pitch and dynamics obtained at Step SA5. The feature parameters are used for 
interpolation. A read and interpolation method is similar to that used at Step SA4. Thereafter, the flow advances to 
Step SA8. 

[0134] At Step SAB the stationary template obtained at Step SA3 is applied to the feature parameters and pitch at 

the time t obtained at Step SA7 by Type 4. 
so [0135] By applying the stationary template at Step SA8, the feature parameters and pitch at the time t are renewed 

to add voice fluctuation given by the stationary template. Thereafter, the flow advances to Step SA10. 

[0136] At Step SA9 the articulation template read at Step SA3 is applied to the feature parameters in the articulation 

part obtained at Step S A4 at the start and end times to obtain the feature parameters and pitch at the time t. Thereafter, 

the flow advances to Step SA10. 
55 [0137] In applying the template, Type 1 is used for a transition from a voiced sound (V) to an unvoiced sound (U), 

Type 2 is used for a transition from a unvoiced sound (U) to a voiced sound (V), and Type 3 is used for a transition 

from a voiced sound (V) to an unvoiced sound (U) or a transition from a unvoiced sound (U) to a voiced sound (V). 

[01 38] The template applying method is alternatively used in the manner described above in order to realize a natural 
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voice change contained in the template while maintaining continuity of the voiced sound part. 
[0139] At Step SA10 one of the NA template, NR template and NN template is applied to the feature parameters 
obtained at Step SA8 or SA9. The template is not applied to EGain of the feature parameters. Thereafter, the flow 
advances to Step SA1 1 whereat the feature parameter generating process is terminated. 
5 [01 40] In applying the template at Step SA1 0, if the note track at the time t is note attack, the NA template obtained 
at Step SA3 is applied by Type 2 to renew the feature parameters. 

[0141] If the note track at the time t is note release, the NR template obtained at Step SA3 is applied bv Tvoe 1 to 
renew the feature parameters. 

[0142] If the note track at the time t is note-to-note, the NN template obtained at Step SA3 is applied bv TvDe 3 to 
io renew the feature parameters. ,p 

[01 43] If the note track at the time t is none of the above-described parts, the template is not applied to EGain of the 
feature parameters. The pitch obtained before Step 10 is directly used. 

[01 44] Interpolation for feature parameters to be performed at Step SA4 shown in Fig. 9 will be described Interpo- 
lation for feature parameters includes interpolation of two sets of feature parameters and estimation from one set of 
15 feature parameters. 

[0145] It is known that il the pitch is changed when a person utters a voice, the glottal waveform (sound source 
waveform generated by air from the lung and vibration of the vocal cord) changes, and that the formants change with 
the pitch. If feature parameters obtained from voices at one pitch are directly used for synthesizing voices at another 
pitch, synthesized voices have a tone color like that of the original voices even if the pitch is changed and are unnatural 
[0146] In order to avoid this, feature parameters are stored in the Timbre database TDB by selecting about three 
points at an equal interval on the logarithmic axis of the compass of two to three octaves corresponding to the human 
singing compass. In order to synthesize voices at a pitch different from the pitches stored in the Timbre database TDB 
the feature parameters are obtained through interpolation (linear interpolation) of two sets of feature parameters or 
estimation (extrapolation) from one set of feature parameters. 

[0147] By using this method, a change in feature parameters of voices at different pitches can be expressed mimet- 
ically. Feature parameters at different pitches are prepared at about three points. The reason for this is as follows 
Even if a voice has the same phoneme and pitch, the feature parameters changes with time. Therefore a difference 
between interpolation at about three points and interpolation at finely divided points is less meaningful 
[01 48] In the interpolation by two sets of feature parameters, the feature parameters at a pitch f 1 [cents] at the time 
t can be obtained by linear interpolation by using the following equation (I) when the two sets of feature parameter* 
and a pair of pitches {P1 , fl [cents]) and {P2. f2 [cents]} are given: 



25 



30 



35 



40 



45 



[0149] In the equation (I), only one pitch is used as the search parameter of the database. If N indices are used (N 
+ 1 ) data in the nearby area surrounding the target is used to obtain the feature parameters to be used as a substitute 
for the target index f from the following equation (!')• 



?=Z^ — J — v.- (0 



so where Pi is the i-th nearby feature parameter and fi is its index. 

[0150] The estimation from one set of feature parameters is utilized when the feature parameters outside of the 
compass of data stored in the database are estimated. 

[0151] If the feature parameters having the highest pitch in the database are used for synthesizing voices having a 
pitch higher than the compass of the database, the sound quality is apparently degraded 

[0152] If the feature parameters having the lowest pitch in the database is used for synthesizing voices having a 
pitch lower than the compass of the database, the sound quality is also degraded. In this embodiment, therefore the 
sound quality is prevented from being degraded by changing the feature parameters in the following manner by using 
rules basing upon knowing from observations of actual voice data. 
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[0153] First, synthesizing voices having a pitch (target pitch) higher than the compass of the database will be de- 
scribed. 

[0154] First, a value PitchDiff [cents] is calculated by subtracting the highest pitch HighestPitch [cents] in the database 
from the target pitch TargetPitch [cents]. 
5 [01 55] Next, the feature parameters having the highest pitch are read from the database. Of the feature parameters, 
the excitation resonance frequency EpRFreq and i-th formant frequency FormantFreqi are added with PitchDiff [cents] 
to obtain EpRFreq' and FormantFreqi' which are used as the feature parameters of the target pitch. 
[0156] Next, synthesizing voices having a pitch (target pitch) lower than the compass of the database will be de- 
scribed. 

10 [0157] First, a value PitchDiff [cents] is calculated by subtracting the lowest pitch LowestPitch [cents] in the database 
from the target pitch TargetPitch [cents]. 

[0158] Next, the feature parameters having the lowest pitch are read from the database. The feature parameters are 
replaced in the following manner to use the replaced feature parameters as the feature parameters at the target pitch. 
[0159] First, the excitation resonance frequency EpRFreq and first to fourth formant frequencies FormantFreq (1 < 
*5 j < 4) are replaced by EpRFreq' and FormantFreqi' by using the following equations (J1) and (J2): 

ERFreq 1 = ERFreq + 0.25 x PitchDiff (J1) 



20 



25 



30 



35 



FormantFreq; = FormantFreqi -f 0.25 x PitchDiff (J2) 

[0160] In order to make the band width narrower as the pitch becomes lower, the excitation resonance band width 
ERBW and first to fourth formant band widths FormantBWi (1 < i < 3) are replaced by ERBW and FormantBWi' by 
using the following equations (J3) and (J4): 

ERBW> = 1-3 x5Sf/ 1200 < J3 > 

FormantFreq; = FormantFreqj + 0.25 x PitchDiff (J4) 

[0161] The first to fourth formant amplitudes FormantAmp 1 to FormantAmp 4 are made large in proportion to PitchDiff 
by using the following equations (J5) to (J8) to be replaced by FormantAmp 1' to FormantAmp 4': 



FormantAmp^ = FormantAmp^ - 8 x PitchDiff / 1200 (J5) 
40 FormantAmp^ = FormantAmp^ - 5 x PitchDiff 1 1200 (J6) 

FormantAmp^ = FormantAmp^ -12 x PitchDiff/ 1200 (J7) 

45 

FormantAmp^ = FormantAmp^ -15 x PitchDiff / 1200 (J8) 
[0162] The slope Eslope of the spectrum envelope is replaced by Eslope' by using the following equation (J9): 

50 

EStope* = ESIope x (1 - 4 x PitchDiff / 1 200) (J9) 

[0163] It is preferable to form the Timbre, database TDB shown in Fig. 4 using the pitch, dynamics and opening as 
55 indices. However, if there are restrictions of time and database size, the database of this embodiment shown in Fig. 3 
using only the pitch as the index is used. 

[0164] The feature parameters using only the pitch as the index are changed by using a dynamics function and an 
opening function. In this case, the effects of using the Timbre database TDB using the pitch, dynamics and opening 
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as indices can be obtained mimetically. 

[0165] Namely, by using voices recorded by changing only the pitch, we can obtain voices as if they are recorded 
by changing the pitch, dynamics and opening can be obtained. The dynamics function and opening function can be 
obtained by analyzing a correlation between the feature parameters and the actual voices vocalized by changing the 
dynamics and opening. 

[0166] Figs. 10A to 10C are graphs showing examples of the dynamics function. Fig. 10A is a graph showing a 
function fEG, Fig. 1 0B is a graph showing a function fES, and Fig. 10C is a graph showing a function fESD. 
[0167] By using the functions fEG, fES and fESD shown in Figs. 10A to 10C, the dynamics value is reflected upon 
the feature parameters ExcitationGain (EG), ExcitationSlope (Es) and ExcitationSlopeDepth (ESD). 
[01 68] All of the functions fEG , fES and fESD shown in Figs. 1 0A to 1 0C are input with a dynamics value which takes 
a value from 0 to 1 . The feature parameters EG\ ES' and ESD' are calculated by the following equations (K1) to (K3) 
by using the functions fEG, fES and fESD to use as the feature parameters at the dynamic value dyn: 

75 EG' = fEG(dyn) (K1) 

ES* = ESx fES(dyn) (K2 ) 
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ESD' = ESD + fESD( dyn) ( K3) 



[0169] The functions fEG, fES and fESD shown in Figs. 10A to 10C are only illustrative. By using various functions 
for singers, voices having more naturalness can be synthesized. 
25 [0170] Fig. 11 is a graph showing an example of the opening function. In Fig. 11 , the horizontal axis represents a 
frequency (Hz) and the vertical axis represents an amplitude (dB). 

[0171] An excitation resonance frequency ERFreq' is obtained from the excitation resonance frequency ERFreq by 
using the following equation (L1 ) to use it as the feature parameters at the opening value Open: 



ERFreq 1 = ERFreq + fOpen(ERFreq) x (1 - Open) (L1) 
where fOpen (freq) is the opening function, 

[0172] An i-th formant frequency FormantFreqi' is obtained from the i-th formant frequency FormantFreqi by usinq 
the following equation (L2) to use it as the feature parameters at the opening value Open: 

FormantFreq; = FormantFreqi fOpen( FormantFreqi) x (1 - Open) (L2) 

[0173] In this manner, the amplitudes of formants in the frequency range from 0 to 500 Hz can be increased or 
decreased in proportion to the opening value so that synthesized voices can be given a change in voice to be caused 
by the lip opening degree. ^" 

[01 74] Synthesized voices can be changed in various ways by preparing the functions to be input with openinq values 
for each singer and changing the functions. 
45 [01 75] Fig. 1 2 is a diagram illustrating an example of a first application of templates according to the embodiment 
Voices of a song shown by a score at (a) in Fig. 12 are synthesized by the embodiment method 
[0176] In this score, the pitch of the first half note is "so", the intensity is "piano (soft)", and the pronunciation is "a" 
The pitch of the second half note is "do", the intensity is "mezzo-forte (somewhat loud)", and the pronunciation is "a"' 
Since the two notes are concatenated by legato, two voices are smoothly concatenated without any pose 
^ [0177] It is assumed that a transition time from "so" to "do" is given within the input data (score) 

[0178] First, the frequencies of two pitches are given from the sound names of the notes. Thereafter, the end and 
start points of the two pitches are interconnected by a straight line to obtain the pitches in the boundary area between 
the notes as indicated at (b) in Fig. 12. 

[0179] Values corresponding to the intensity symbols such as "piano (soft)" and "mezzo-forte (somewhat loud)" are 
stored beforehand in a table. By using this table, the intensity symbol is converted into the intensity value to obtain 
dynamics values of the two notes. By interconnecting the obtained two dynamics values, the dynamics values in the 
boundary area between the notes as indicated at (b) in Fig. 12 can be obtained. 
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[01 80] If the pitches and dynamics values obtained In the above manner are used, the pitches and dynamics change 
abruptly in the boundary area. In order to concatenate the notes by legato, the NN template is applied to the boundary 
area as indicated at (b) in Fig. 12. 

[0181] In this case, the NN template is applied only to the pitches and dynamics to obtain pitches and dynamics 
which smoothly concatenate the boundary area between two notes as indicated at (c) in Fig. 12. 
[0182] Next, by using the pitches and dynamics determined as indicated at (c) in Fig. 12 and the phoneme name "a" 
as indices, the feature parameters at each timing are obtained from the Timbre database TDB as indicated at (d) in 
Fig. 12. 

[0183] The stationary template corresponding to the phoneme name "a" as indicated at (d) in Fig. 12 is applied to 
the feature parameters at each timing to add voice fluctuation to the stationary parts other than the concatenated points 
at the boundaries of the notes and obtain the feature parameters as indicated at (e) in Fig. 12. 
[0184] The NN template for the remaining parameters (such as formant frequencies) excepting the pitches and dy- 
namics applied as indicated at (b) in Fig. 12 is applied to the feature parameters indicated at (e) in Fig. 12 to add 
fluctuation to the formant frequencies and the like in the boundary area between the notes as indicated at (f) in Fig. 12. 
[0185] Lastly, by using the pitches and dynamics indicated at (c) in Fig. 12 and the feature parameters indicated at 
(f), voices are synthesized so that the song of the score indicated at (a) can be synthesized. 

[01 86] The time width of the NN template as indicated at (b) in Fig. 1 2 can be broadened, for example, as shown in 
Fig. 13. As shown in Fig. 13, as the time width of the NN template is broadened, the stretched NN template is applied 
so that voices of a song can be synthesized having a gentle change. 

[0187] Conversely, if the time width of the NN template is narrowed, voices of a song can be synthesized having a 
quick and smooth change. By controlling the application time of the NN template, the transition speed can be controlled. 
[0188] Even if the pitch is changed from one frequency to another frequency in the same time period, there are 
different singing methods of changing quickly in the first half part and changing slowly in the last half, or vice versa. 
There are several different pitch change methods, and this difference results in a musical listening difference. If a 
plurality type of NN templates are formed from voices vocalized in different ways of legato, synthesized voices can 
have many variations. 

[0189]. There are many methods of changing the pitch including legato. Templates for these voices may also be 
recorded. 

[01 90] For example, there is glissando by which the pitch is changed at each halftone or the pitch is changed stepwise 
only at the scale of a key of a song (e.g., in C major, do, re, mi, fa, so, la, ti, do), as different from legato by which the 
pitch is changed perfectly continuously. 

[0191] If an NN template is formed from actual voices vocalized by glissando and applied to voices, voices concate- 
nating two notes smoothly can be synthesized. 

[01 92] In this embodiment, the NN template used is formed from voices of the same phoneme and different pitches. 
An NN template may be formed from voices of different phonemes such as from "a" to "e" and different pitches. In this 
case, although the number of NN templates increases, synthesized voices can be made more like actual voices of a 
song. 

[01 93] Fig. 1 4 is a diagram illustrating an example of a second application of templates according to the embodiment. 
Voices of a song shown by a score at (a) in Fig. 13 are synthesized by the embodiment method. 
[0194] In this score, the pitch of the first half note is "so", the intensity is "piano (soft)", and the pronunciation is "a". 
The pitch of the second half note is "do", the intensity is "mezzo-forte (somewhat loud)", and the pronunciation is "e". 
[0195] It is assumed that an articulation time from "a" to "e" is set to a fixed value for each of the combinations of 
two phonemes, or given when the input data is given. 

[01 96] First, the frequencies of two pitches are given from the pitch names of the notes. Thereafter, the end and start 
points of the two pitches are interconnected by a straight line to obtain the pitches in the boundary area between the 
notes as indicated at (b) in Fig. 14. 

[0197] Values corresponding to the intensity symbols such as "piano (soft)" and "mezzo-forte (somewhat loud)" are 
stored beforehand in a table. By using this table, the intensity symbol is converted into the intensity value to obtain 
dynamics values of the two notes. By interconnecting the obtained two dynamics values, the dynamics values in the 
boundary area between the notes as indicated at (b) in Fig. 14 can be obtained. 

[0198] Next, by using the pitches and dynamics determined as indicated at (b) in Fig. 14 and the phoneme names 
"a" and "e" as indices, the feature parameters at each timing are obtained from the Timbre database TDB as indicated 
at (c) in Fig. 14. The feature parameters in the articulation part are obtained by linear interpolation, for example, by 
using a straight line interconnecting the end point of the phoneme "a" and the start point of the phoneme "e". 
[0199] Next, as indicated at (c) in Fig. 14, a stationary template of "a", an articulation template from "a" to "e" and a 
stationary template of "e" are applied to the corresponding ones of the feature parameters to obtain feature parameters 
as indicated at (d) in Fig. 14. 

[0200] Lastly, by using the pitches and dynamics indicated at (b) in Fig. 14 and the feature parameters indicated at 
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(d), voices are synthesized. 

[0201] We can synthesize voices of the song capable of changing naturally from "a" to "e" similar to actual voices 
sung by a singer. 

[0202] Similar to the NN template, if the length of the boundary area (articulation part) is given within the score the 
artieulaton time from "a" to V can be controlled and voices changing slowly or voices changing quickly can be syn- 
thesized by stretching or shrinking one template. The phoneme transition time can therefore be controlled 
[0203] Fig. 15 is a diagram illustrating an example of a third application of templates according to the embodiment 
Voices of a song shown by a score at (a) in Fig. 1 4 are synthesized by the embodiment method 
[0204] In this score, the pitch of the whole note is "so", the pronunciation is "a", and the intensity of the whole note 
is gradually raised in the rising part and gradually lowered in the falling part 

IhTSL 2 !h S V?' , Pi ! Ch6S and d y namics are flat as indicated at (b) in Fig. 15. The NA template is applied to 
the start of the pitches and dynamics, and the NR template is applied to the end of the note, to thereby obtain and 
determine the pitches and dynamics as indicated at (c) in Fig. 15. 

[0206] It is assumed that the lengths of the NA template and NR template to be applied are input directly from the 
crescendo symbol and decrescendo symbol. u»ecuy From me 

[0207] Next, by using the determined pitches and dynamics indicated at (c) in Fig. 15 and the phoneme name "a" 

ZXl^^gli ,he lntermediate part which is neither the attack part nor the • « 

f?? 81 h , The f Slali ° nary lemp ' ale is applied 10 the ,ealure Paramelers in the intermediate pari indicated at (d) in Fiq 

indi^t^ jr/? I? TT 9iV6n f ' UCtUati0n 33 indiCat6d at (e) in R 9" 15 - B V usi "9 these feature parameters 
nd.cated at (e) m Fig. 1 5, the feature parameters in the attack part and release part are obtained 

jfoel Jth! 2 a rt J Lo i P n ara f r ? h eterS t ^ attaCk Part ° btained by apP ' yin9 the NA tem P |ate of ,he Phoneme "a" by 
Type 2 to the start point of the intermediate part (end point of the attack part) 

T«nl°i t IT fea * ure .P aram u eters in the Please part are obtained by applying the NR template of the phoneme "a" by 
Type 1 to the end po.nt of the intermediate part (start point of the release part) 

[ndilLn 1 ", m 6 ab ° Ve ™ nner > the ,eature P^ameters in the attack, intermediate and release parts are obtained as 
indicated at (f) ,n F,g. 15. By using these feature parameters and the pitches and dynamics indicated at (c) in Fig 15 

foTl21 ActorHinn n th ? « ^ 1 5 and SUn9 by CreSCe " d ° and decrescendo can be synthesized.' 

[0212] Accordmg to the embodiment, the feature parameters are modified by using phoneme templates obtained bv 
ana, y2ln9 ac tua. voices sung by a singer. It is therefore possible to generate naturafsynthesized Zc^s ret^ng the 
charactenst.es of a stretched vowel part and a phonetic transition of voices of the song meeting 

Lnalv^inn^nSl" 9 10 ^ ""J 0 *™™' the feature Parameters are modified by using phoneme templates obtained by 
analyzing actual voces sung by a singer. It is therefore possible to generate synthesized voices having musical intensitv 
expression that is not a mere volume difference. 9 musical ,nlensit y 

35 SSI'S Acc ° rdingto tne embodiment, even if data providing finely changed musical expression such as pitches 
dynamics and opening 1S not prepared, other data can be used through interpolation. Therefore, the number S 

shortened S ° that Si26 ° f 3 dat3baSe be made Sma " and th6 time for fo ™"° th e datebase cTn P be 

[0215] According to the embodiment, even if the database using as an index only the pitch as musical expression 
Zamtr L ^ ° f r 3 databaSS USin9 35 indiC6S three musical expressions including pS^KJS 
Ro P aShounh fh 0 T'™ y * ^ ° P<3 ™ 9 dynamics f unctions - ln this embodimen , as sTown in 

f^nJ? I P data SC ° re ' S consti,uted ° f the Phoneme track PHT, note track NT, pitch track PIT dynamics 
<™£ . ° Pen,ng traCk OT> thS StmctUre of the inpyt data Score is not limited only thereto 
- J ™L vlTfrom 0 to I""' 0 ^ add6d t0 inPUt d3ta SC ° re Sh ° Wn in R9 - 2 " The vibrato track rec °rds 
[0217] In this case, a function that returns a sequence of pitches and dynamics by using a vibrato value as an arau 
mentor stores a table of vibrato templates is stored in the database 4 9 a v.orato value as an argu- 

SSIL Ca ' CUlatin9 Xh ° p ! tc u hes and d y namics at ste P SA5 shown in Fig. 4, the vibrato template is applied so that 
pitches and dynamics added the vibrato effects can be obtained. KH 
l^V** Vibrat ° tem P ,ate can be obtained by analyzing actual human singing voice 

Sim?i?^?rir , |!l iment ha ! b6en deS ° ribed main ' y With rSSpeCt t0 Sin9ing VOice ^esis, the embodiment 
sized V °' CeS 96neral C ° nVerSati0n and sounds of m ^ioal instruments may also be synthe- 

s 9 xs^sz"** by a computer or the iike instai,ed wi,h a compu - ™™ - - - 

[0222] In this case, the computer program and the like realizing the embodiment functions may be stored in a com- 

™™ ead ^ * St ° ra9e mediUm SUCh 33 3 CD " ROM and a fl °PPV disc to distribute it te a user 
[0223] If the computer and the like are connected to the communication network such as a LAN. the Internet and a 
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telephone line, the computer program, data and the like may be supplied via the communication network. 
[0224] The present invention has been described in connection with the preferred embodiments. The invention is 
not limited only to the above embodiments. It Is apparent that various modifications, improvements, combinations, and 
the like can be made by those skilled in the art. 



Claims 

1 . A voice synthesizing apparatus comprising: 

w 

means for storing phoneme pieces having a plurality of different pitches for each phoneme represented by a 
same phoneme symbol; 

means for reading a phoneme piece by using a pitch as an index; and 

a voice synthesizer that synthesizes a voice in accordance with the read phoneme piece. 

2. A voice synthesizing apparatus comprising: 

means for storing phoneme pieces having a plurality of different musical expressions for each phoneme rep- 
resented by a same phoneme symbol; 
20 means for reading a phoneme piece by using the musical expression as an index; and 

means for synthesizing a voice in accordance with the read phoneme piece. 

3. A voice synthesizing apparatus comprising: 

25 means for storing a plurality of different phoneme pieces for each phoneme represented by a same phoneme 

symbol; 

means for inputting voice information for voice synthesis; 

means for calculating a phoneme piece matching the voice information by interpolation using the phoneme 
pieces stored in said means for storing, if the phoneme piece matching the voice information is not stored in 
30 ~ said means for storing; and 

means for synthesizing a voice in accordance with the phoneme piece calculated through interpolation. 

4. , A voice synthesizing apparatus comprising: 

35 means for storing a change amount of a voice feature parameter as template data; 

means for inputting voice information for voice synthesis; 

means for reading the template data from said memory in accordance with the voice information; and 
means for synthesizing a voice in accordance with the read template data and the voice information. 

w 5. A voice synthesizing method comprising, the steps of: 

a) reading a phoneme piece by using a pitch as an index from a storage medium that stores phoneme pieces 
having a plurality of different pitches for each phoneme represented by a same phoneme symbol; and 

b) synthesizing a voice in accordance with the read phoneme piece. 

ts 

6. A voice synthesizing method comprising, the steps of: 

a) reading a phoneme piece by using a musical expression as an index from a storage medium that stores 
phoneme pieces having a plurality of different musical expressions for each phoneme represented by a same 

50 phoneme symbol; and 

b) synthesizing a voice in accordance with the read phoneme piece. 

7. A voice synthesizing method comprising, the steps of: 

55 a) reading a phoneme piece from a memory that stores a plurality of different phoneme pieces for each pho- 

neme represented by a same phoneme symbol; 

b) inputting voice information for voice synthesis; 

c) calculating a phoneme piece matching the voice information by interpolation using the phoneme pieces 
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stored in said memory, if the phoneme piece matching the voice information is not stored in said memory and 
d) synthesizing a voice in accordance with the phoneme piece calculated through interpolation. 

8. A voice synthesizing method comprising, the steps of: 

a) storing a change amount of a voice feature parameter as template data in a memory; 

b) inputting voice information for voice synthesis; " 

c) reading the template data from said memory in accordance with the voice information- and 

d) synthesizing a voice in accordance with the read template data and the voice information. 

9. A program that a computer executes to realize a voice synthesizing process comprising, the instructions of: 

a) reading a phoneme piece by using a pitch as an index from a storage medium that stores phoneme pieces 
having a plurality of different pitches for each phoneme represented by a same phoneme symbol- and 

b) synthesizing a voice in accordance with the read phoneme piece. 

10. A program that a computer executes to realize a voice synthesizing process comprising, the instructions of: 

a) reading a phoneme piece by using a musical expression as an index from a storage medium that stores 
phoneme symbol and 9 & * mUSi ° al eXpressions for each P^eme represented by a same 

b) synthesizing a voice in accordance with the read phoneme piece. 

11- A program that a computer executes to realize a voice synthesizing process comprising, the instructions of: 

a) reading a phoneme piece from a memory that stores a plurality of different phoneme pieces for each pho- 
neme represented by a same phoneme symbol; 

b) inputting voice information for voice synthesis; 

c) calculating a phoneme piece matching the voice information by interpolation using the phoneme pieces 
stored in said memory, if the phoneme piece matching the voice information is not stored in said memory and 

d) synthesizing a voice in accordance with the phoneme piece calculated through interpolation. 

12. A program that a computer executes to realize a voice synthesizing process comprising, the instructions of: 
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a) storing a change amount of a voice feature parameter as template data in a memory- 

b) inputting voice information for voice synthesis; 

c) reading the template data from said memory in accordance with the voice information- and 

d) synthesizing a voice in accordance with the read template data and the voice information. 
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FIG. 3 
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FIG. 7 



PHONEME 
NAME 


REPRESENTATIVE 
PITCH [Hz] 


FEATURE PARAMETER 


/a/ 


200 


fPa1(t).Pitch a1(t).Ta1] 


/a/ 


300 


fPa2(t).Pitch a2(t).Ta2l 


/a/ 


400 


{Pa3(t).Pitch a3(t).Ta3l 


/i/ 


200 


{Pi1 (t) .Pitch i1(t).Ti1l 


/i/ 


300 


(Pi2(t).Pitch i2(t).Ti2] 








/o/ 


500 


(Po4(t),Pitch o4(t).To4| 



FIG. 8 



PRECEDING 
PHONEME NAMF 


SUCCEEDING 
PHONEME tm/E 


REPRESENTA- 
TIVE FITCH fHzl 


FEATURE RARWETER 


/a/ 


A/ 


200 


{Pai1(t).Rtch ai1(t).Tai1} 


/a/ 


A/ 


400 


(Pai2(t).Rteh ai2(0.Ta'2} 


/a/ 


/s/ 


300 


tPas1(t).Rteh as1(t).Tas1) 


/a/ 


/s/ 


500 


{Pas2(t)fltch_as2(aTas2l 










/s/. 


/o/ 


500 


{Pso3(t).Pttch so3(t).Tso3] 



19 

.1 239457 A2J_> 



EP 1 239 457 A2 



SA2 



S A3 



SA4 



SA5 



SA6 
SA7 



SA8 



FIG. 9 



c 



FEATURE PARAMETER 
GENE RATING PROCESS 





SA1 



ACQUIRE VALUE OF EACH TRACK 
OF INPUT DATA Score AT TIME t 



X 



IN ACCORDANCE WITH EACH 
, TRACK VALUE OF INPUT DATA 
Score, READ NECESSARY 
TEMPLATES FROM PHONEME 
TEMPLATE DATABASE AND NOTE 
TEMPLATE DATABASE 



ACQUIRE START & END TIMES OF 
AREA HAVING SAME ATTRIBUTE 
OF NOTE TRACK AT TIME t, AND 
ACQUIRE OR CALCULATE 
FEATURE PARAMETERS AT 
START TIME, END TIME OR BOTH 
START & END TIMES 



APPLY TEMPLATE TO FEATURE 
PARAMETERS & PITCHES AT 

START & END TIMES TO OBTAIN 
PITCH & DYNAMICS AT TIME t 




PHONEME AT TIME t IS 
ARTICULATION ? 




YES 



NO 



READ FEATURE PARAMETERS 
FROM TIMBRE DATABASE BY 
USING AS INDICES PHONEME 
NAME. PITCH & DYNAMICS AT 
TIME t AND INTERPOLATE THEM 



SA9 



APPLY ARTICULATION TEMPLATE 
TO FEATURE PARAMETERS IN 

ARTICULATION PART AT START & 
END TIMES TO OBTAIN FEATURE 
PARAMETERS & PITCH AT TIME t 



APPLY STATIONARY TEMPLATE 
TO FEATURE PARAMETERS AND 
PITCH AT TIME t BY TYPE 4 



c 



APPLY ONE OF NA TEMPLATE, NR 
TEMPLATE AND NN TEMPLATE TO 
FEATURE PARAMETERS 

7 



RETURN 



■k SA10 

r 

J SA11 



BNSDOCiD: <EP 1 239457 A2J_> 



20 



EP 1 239 457 A2 



FIG. 10 A 




FIG. 10B 




0.2 0.4 0.6 0.8 1.0 1.2 1.4 

Dynamics 

FIG. 10C 




Dwciw^in. ^CD •« <V3P»XC-» A <-» 



21 



EP 1 239 457 A2 



FIG. 11 

20 1 ■ 

r— -I 




10000 15000. 20000 25000 



Frequency[Hz] 



8NSDOCID: <EP 12394S7A2J_> 



22 



EP 1 239 457 A2 



a 

/a/ 



PITCH 
DYNAMICS 

NN TEMPLATE 



PITCH 

(DETERMINED) 

DYNAMICS 
(DETERMINED) 



FEATURE 
PARAMETER 

STATIONARY 
TEMPLATE 



FEATURE 
PARAMETER 

NN TEMPLATE 



FEATURE 

PARAMETER 

(DETERMINED) 



FIG. 12 



a 

/a/ 

mf 



NoteToNote 



a (stationary) 



NoteToNote 



(a) 



(b) 



(c) 



(d) 



(e) 



(f) 



23 



BP 1 239 457 A2 



FIG. 13 
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FIG. 14 
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FIG. 15 
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